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Abstract 

An improved neighbor list algorithm is proposed to reduce unnecessary interatomic 
distance calculations in molecular simulations. It combines the advantages of Verlet 
table and cell linked list algorithms by using cell decomposition approach to accel- 
erate the neighbor list construction speed, and data sorting method to lower the 
CPU data cache miss rate, as well as partial updating method to minimize the un- 
necessary reconstruction of the neighbor list. Both serial and parallel performance 
of molecular dynamics simulation are evaluated using the proposed algorithm and 
compared with those using conventional Verlet table and cell linked list algorithms. 
Results show that the new algorithm outperforms the conventional algorithms by a 
factor of 2 3 in cases of both small and large number of atoms. 
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1 Introduction 



Some molecular simulation techniques such as molecular dynamics and Monte 
Carlo method are widely used to study the physical properties and chemi- 
cal processes which contain a large number of atoms in statistical physics, 
computational chemistry, and molecular biology [1]. All these methods in- 
volve evaluation of the total interatomic potential energy Vtot of N atoms and 
its gradients. The potential energy contains various interatomic interactions 
in the physical system, and is usually the function of internal coordinates of 
atoms. For example the potential energy of liquids and gases is often described 
as a sum of two-body (or pairwise) interactions over all atom pairs. A com- 
mon choice of two-body interatomic interaction expression is Lennard- Jones 
potential function, which is a simple function of the distance rij between atom 
i and j, and is shown as follows. 



The total potential energy is a sum of two-body interactions over all atom 
pairs. 



In molecular dynamics simulation, evaluation of Eq.(2) and its gradient usually 
costs most of CPU time. Apparently, direct calculation of Eq.(2) requires N'^ 
steps. If we use Newton's third law, the total calculation steps can be decreased 
to N{N -~l)/2. Obviously it is formidable to carry out such a calculation when 
there are many atoms in the system, and some methods are strongly needed 
to reduce the redundancy in evaluation of Eq. (2) . 

Firstly a cutoff distance Tcut is introduced in potential functions, and both 
potential functions and their gradients beyond the cutoff distance are assumed 
to be zero. This treatment can reduce the computing time greatly by neglecting 
all atoms beyond the cutoff distance, since interactions between these atoms 
are zero and needn't to be considered. However, straightforward determination 
of which atoms are within cutoff distance needs to evaluate all interatomic 
distances over all atom pairs, and this procedure scales 0{N^) as the system 
size. 

Effective reduction of unnecessary interatomic distance evaluation can be ac- 
complished by Verlct table algorithm [2] and cell linked list algorithm [3]. 
However, there is a tradeoff between overhead for maintaining neighbor list 
table and reduction of unnecessary interatomic distance calculation. 




(1) 
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Verlet table algorithm and cell linked list algorithm have been intensively stud- 
ied and have shown significant reduction in total computing time [4,5,6,7,8,9]. 
Glikman et al proposed an alternative Verlet table method called relational 
method which updates the neighbor list recursively and is called in every 
timestep [10], Eisenhauer considered the locality of the neighbor list and used 
a simple slab-based decomposition in his work [11]. In 1994. Hansen and Evans 
employed cell method in conventional Verlet table algorithm to accelerate the 
neighbor list construction speed [12], and Frenkel also considered combination 
of two conventional neighbor list algorithms in his book [13]. In our work im- 
proved neighbor list algorithm is proposed and the overhead to maintain the 
neighbor list table has been reduced to order 0(N). The details and bench- 
mark of this algorithm are given in the paper. 



2 Conventional neighbor list algorithms and the improved algo- 
rithm 

For the details of conventional Verlet table and cell linked list algorithms one 
can refer to Allen and Tildesley's book [14], and here we only mention some 
fundamental and important facts. Some graphs in this section are drawn in 
2D for the convenience of illustration, however, all related discussions can be 
easily generalized to 3D systems. 

2.1 Conventional Verlet table algorithm 

The basic idea of Verlet table algorithm is to construct and maintain a list 
of neighboring atoms for each atom in the system [2,14]. During the simu- 
lation, the neighbor list will be updated periodically in a fixed interval, or 
automatically when the displacements of some atoms are larger than a cer- 
tain value [15]. In this algorithm the potential CTitoff sphere with radius Tcut 
is surrounded by a "skin" r^, to give a larger sphere with radius + [14]. 
When constructing a neighbor fist for an atom i, another atom j is considered 
as a "neighbor" if the distance between them rij < Tcut + rg. It should be 
noticed that the "skin" should be large enough, so that between two times 
of neighbor list reconstruction no atom can penetrate through the skin into 
the cutoff sphere of another atom if it is not in the neighbor list of that atom. 

Conventionally in this algorithm, interatomic distances of all atom pairs need 
to be calculated, so the total CPU time scales 0{N'^) as the number of atoms. 
However, evaluation of the total potential energy and its gradient using the 
neighbor list is efficient because only atoms appearing in the list, i.e., those in 
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Fig. 1. Illustration of conventional Verlet table algorithm 

the sphere of rcut+^'s, are checked, thus the overall procedure scales O(A^-A^nl), 
where A^nl is the average number of neighbors of an atom in the material and 
is independent to the system size N. 

Verlet table algorithm has been proven to be efficient when the number of 
atoms is relatively small and the atoms move slowly. Its main advantage is 
the high efficiency of potential/force evaluation using the neighbor hst, as the 
atoms in the list are only a few more than actual needed (which rcut < f^ij < 
fcut + ^s)- On the other hand, its main drawback is that construction of the 
neighbor list scales 0{N^) (needs A^(A^ — l)/2 steps to build the list for all 
atoms). Moreover, with the increase of atoms' mobility, either the "skin" or 
the frequency of reconstructing the neighbor list must increase. Both of them 
make the overall simulation time increases dramatically. 

2.2 Conventional cell linked list algorithm 

The cell linked list algorithm is effective when the number of atoms is large. In 
this algorithm, the simulation domain is partitioned into several cells, and the 
edge of cells is equal to or larger than rcut- All atoms are assigned to these cells 
by their positions, at the same time a linked list of the atom indices is created. 
At the beginning of a simulation, an array containing a list of cell neighbors for 
each cell is created, and this list remains fixed unless the simulation domain 
changes during the simulation [14]. 

A cell m is considered as a neighbor of another cell n, if there are a point at 
Ti in cell m and another point at r.j in cell n such that |ri — r^l < rcut- Since in 
the conventional method the edge of the cell is equal to or larger than rcut, and 
considering the periodic boundary condition, there are 8 (for 2D systems) or 
26 (for 3D systems) neighbors for each cell. Thus the neighbors of each atom 
can be found in the cell where the atom located and the neighboring cells. 

The construction of "neighbor list", i.e., assigning each atom to appropriate 
cells, scales 0{N), but we can see that a big number of atoms need to be 
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checked in the potential/force evaluation procedure, and this is rather ineffi- 
cient compared to Verlet table algorithm. A common choice of the cell edge 
is the potential cutoff distance Vcut, thus for each atom, all atoms in 27 cells, 
or in the volume of 27r^^^, will be checked in the potential/force evaluation 
procedure^ . Ideally, only atoms in the volume of fTrr^^^ ^ 4.189 r^^^ fall in 
the cutoff distance, and accordingly need to be checked. 

However, if a small cell edge is used, volume containing the atoms need to be 
checked will be dramatically reduced. For example if the cell edge is |rcut, the 
volume will be (2.5)^ r^^t, only 57.87% of the previous volume, as Fig. 2 shows 
for 2D systems. 




Fig. 2. Choice of the length of cell edge in the cell linked list algorithm: rcut vs 2''cut- 



Furthermore, the cell edge can be very small so that only one atom can be 
contained in the cell, as described in Ref. [14]. This method is reported to has 
higher performance than conventional algorithm. However, according to our 
tests, it is still not as fast as Verlet table algorithm when the number of atoms 
is relatively small. 

From above discussions, we can see that the advantage of the cell linked list 
algorithm is the fast and efficient construction of the "neighbor list" , but its 
disadvantage is that too many atoms need to be checked in the potential/force 
evaluation procedure, and improvements to this algorithm are not competitive 
to Verlet table algorithm with the increasing of implementation complexity 
when the number of atoms is relatively small. 



^ If Newton's 3rd law is used and the interatomic interaction is described by a 
two-body potential function, only half of neighboring cells needs to be searched, 
in other words, only 14 cells arc needed for 3D systems. However, if multiple-body 
potential functions are adopted, especially those using bond order terms, we can- 
not safely do this. For example in Brenner potential, calculation of N^"'-' = 1 -|- 

[J:1%':T mr^k)F{X,kr + IT.rjI'rf^li^^O^i^Jl)^'^ determination of whether 
a bond is part of a conjugated system or not, needs to reference all neighbors of 
every atom. In this case, searching only half of the neighboring cells is not correct. 
Therefore, in general cases we say all 27 cells need to be searched. 
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2.3 The improved neighbor list algorithm 

Verlet table algorithm will scale linearly as the number of atoms if cell decom- 
position approach is adopted. Together with data sorting and partial updating 
method, we get an improved neighbor list algorithm with high performance. 

2.3.1 Cell decomposition approach 

As cell decomposition approach has been introduced and used previously 
[13,12], here we only give its main features. In this approach two conven- 
tional algorithms are combined together. Firstly the simulation domain is 
partitioned into several cells, and then each atom is assigned to these cells by 
their positions, in the following the neighbor list is constructed by searching 
only in neighboring cells, instead of checking all atom pairs in the system. It 
can be seen that the construction of the neighbor list scales 0{c- N), where c 
is a constant and independent to N. For a system containing more than 1000 
atoms, this approach improves the overall performance significantly. 

For systems with high atomic mobility, we find that the cell edge of |rcut gives 
best performance after carrying out many test runs on some different kinds of 
computers. 

2.3.2 Partial updating of the neighbor list 

In certain systems where a small number of atoms have high mobility and 
most others only oscillate at equilibrium positions, the choice of skin in Verlet 
table algorithm can be very tricky. Small value ensures that atoms which are 
stored in the list and beyond the cutoff distance (rcut < r < Tcut + ^s) are as 
less as possible, however, updating of neighbor list will become frequent. On 
the other hand, the potential/force evaluation efficiency is lowered when using 
large skin value, it can be tought to find a good value. 

Fortunately, this problem can be solved in the framework of cell decomposition 
approach. As searching for neighbors is confined to several cells, after the 
neighbor list is out-of-date and needs to be updated, reconstruction procedure 
can be carried out for atoms in relevant cells only, instead of for every atom 
in the system. 

In this method, each cell is accompanied with a "dirty fiag", which denotes 
if the neighbor list corresponding to this cell needs to be updated. At every 
timestep all atoms' accumulative displacements will be checked in each cell, 
if the sum of the largest two displacements is greater than r^, then the "dirty 
fiag" of this cell and neighboring cells will be turned on, and the neighbor list 
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construction procedure is carried out for these cells. After the construction 
procedure these flags are turned off. 

For aforementioned system this method effectively improves the overall perfor- 
mance because the CPU time on updating the neighbor list is minimized. In 
addition, the skin can be small enough so that the potential/force evaluation 
efficiency is maximized. 

It needs to be noticed that in this method the neighbor hsts of different atoms 
should be stored separately (e.g., the neighbor list array can be defined in 
the form of nlist [N] [MAXNL] , where N is the number of atoms and MAXNL is 
the maximum number of neighbors of an atom). This, however, may slightly 
increase the memory requirement although this is not serious in modern com- 
puters where the internal memory is at least hundreds of MB. 

2.3.3 Acceleration of data access by data sorting 

Considering pipeline architectures of modern CPUs, further effort can be taken 
to maximize the computing performance: sorting the storage sequences of 
atoms in the memory, thus the memory locations of atoms which in the same 
cell or neighboring cells are also close to each other, then the data can be 
loaded and cached in the CPU more efficiently. 




Fig. 3. Illustration of data sorting method. In this method, atoms which are close 

to each other in the simulation domain are also stored continuously. Every timestep 
after updating the neighbor list, data of all atoms are sorted to maximize the data 
cache hit rate. 

For a better understanding of this method, we can consider the simulation of 
gas and liquid materials which have high atomic mobility. In the beginning, 
after the molecular structure is initialized, the data of positions, velocities 
and accelerations are well ordered and stored in the memory continuously. 
Sometimes data of all neighboring atoms can be loaded into CPU data cache 
if original data is well organized. But when the simulation is going on and the 
atoms are moved here and there, memory locations of these data become more 
and more disordered, and the neighboring atoms can be seldom loaded in the 
same cache line. Then the CPU is hard to find the data of neighboring atoms in 
the data cache, and has to stall the execution and fetch required data into the 
data cache. However, irrelevant data along with fresh loaded data pollute the 
data cache, therefore the data cache miss rate can be high. This phenomena 
can be detected in a long time simulation, as an example shown in Fig. 4, 
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where a very significant performance degradation can be seen. 
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Fig. 4. Performance degradation in a gas simulation due to higii data cache miss 
rate. The performance is measured in the unit of "atom-step/second", which can be 
calculated by multiplying the number of atoms by the number of steps and divided 
by the elapsed CPU time. 

The solution of this problem is rather straightforward, i. e., sorting the data of 
atoms by their positions and making the memory locations of the neighboring 
atoms as close as possible. By adopting this method, no explicit performance 
degradation can be seen in long time simulations. The procedure to carry out 
the data sorting is shown in Algorithm 1. In this procedure, firstly we find the 
longest direction of the domain, and partition the domain into several layers 
along this direction, lastly the data of all atoms are stored layer by layer. 
Please note that a "weak" sorting is used, it means that data of atoms inside 
a layer may be disordered. It can be seen that "weak" data sorting procedure 
scales 0{N) as the number of atoms. 

Together with the cell decomposition approach in section 2.3.1, the data lo- 
cality is enhanced. Thus when parallelizing the simulation program on SMP 
platforms, the data can be easily and well partitioned, and the neighbor list 
construction can be carried out by each CPU in the computer simultaneously, 
thus a high parallelization execution can be achieved. 

In this work, overall procedure for constructing the neighbor list ^ is shown in 
Algorithm 2. Here it should be mentioned that data sorting procedure may 
not be carried out for low atomic mobility systems. 



^ Full neighbor list construction is carried out at the first timcstcp and at every 
timestep after the data sorting, and partial updating is performed between two 
times of data sorting. 
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Algorithm 1. Data sorting procedure in the improved neighbor list algorithm 

m <— direction of the longest edge of domain {m = 1,2,3} 
nlayer ^ number of layers in direction m 
for alH in 1 • ■ ■ do 
k <— layer index of atom i 

numlay{k) <— numlay{k) + 1 {number of atoms in layer k} 
end for 

{Find the index of first atom in each layer.} 

layer (1) ^ 1 

for alH in 1 ■ ■ ■ nlayer — 1 do 

layer{i + 1) ^ layer{i) + numlay{i) 
end for 

{Move data to temporary storage and do weak sorting.} 
for i = 1 to do 

k <— layer index of atom i 

j ^ layer{k) + numlay{k) — 1 

NR{j)^R{i),NV{j)^V{i) 

numlay{k) <— numlay{k) — 1 
end for 

R ^ NR, V ^NV 



Algorithm 2. The improved neighbor list algorithm 
{Assigning each atom into its appropriate cell} 
for alH in 1 • • • do 

k <— cell index of atom i 

append i into the list of cell k 
end for 

call sorting {Sorting atoms by their position. See Algorithm 1} 
for alH in 1 • • • iV do 
I <— cell index of atom i 
for all m e Z and neighbors of I do 
for all j G cell m and j ^ i do 
^ R{j) - R{i) 
apply the periodic boundary condition to rjj 
if \Yij\ < Tcut + Ts then 

append j into the neighbor list of atom i 
end if 
end for 
end for 
end for 
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3 Results and discussions 



A molecular dynamics simulation program using Lennard- Jones two-body po- 
tential function is developed to compare the performance of three different 
neighbor hst algorithms. The benchmarks are carried out on a Compaq Alpha 
Server DS20 with two EV67/667 MHz processors and Tru64 5.1A operating 
system installed, and the program is written in Fortran 90 and compiled by 
Compaq Fortran compiler V5. 5-1877. We also run the same benchmarks on a 
PC with one Intel Pentium III 866 MHz CPU and Red Hat Linux 8.0 installed, 
and a HP RX2600 workstation with two Intel ltanium2 900 MHz processors 
and Red Hat Linux Advanced Workstation installed. Three CPUs have dif- 
ferent architectures: Alpha EV67 is a typical RISC (Reduced Instruction Set 
Computing) CPU, Intel Pentium HI is a CISC (Complex Instruction Set Com- 
puting) CPU, while Itanium2 is declared as a brand new architecture named 
EPIC (Explicitly Parallelized Instruction Computing). The performance on 
three platforms are different, however the difference among three algorithms 
are qualitatively similar. All results presented in this section are based on the 
benchmark on Alpha Server DS20. 

In order to measure the performance of an algorithm quantitatively, a new 
unit named atom- step/second is defined. It can be simply calculated by multi- 
plying the number of atoms and number of timesteps divided by the number 
of CPU time elapsed in second. Larger value of this quantity stands for better 
performance. If computer hardware architectures are identical, and the simu- 
lation program scales 0{N) perfectly as the system size, then this quantity is 
proportional to the CPU performance. 

In our benchmark, firstly some Argon atoms are randomly placed in the do- 
main according to the predetermined density. Then the molecular dynamics 
simulation in canonical ensemble is performed, and the number of steps is 10^ 
for 10"^ atoms and more, or 10^ for 10^ ~ 10"^ atoms, or 10^ for 999 atoms or 
less. The temperature of the system is 300 K. 

The time integration algorithm is implementated by velocity Verlet scheme, 
and the canonical ensemble simulation is implemented by using Nose-Hoover 
thermostat [16]. Real physical units, instead of reduced units, are used. We 
collect the parameters in the simulation and show in Table 1. 

The simulation program is parallized using OpenMP technique [17]. Serial and 
parallel version of the program are compiled by turning on or off the relevant 
compihng options of OpenMP ^ . 

^ In Compaq Fortran compiler on Tru64/ Alpha option -omp enables OpenMP, 
in HP Fortran compiler on HP-UX/Itanium option +Oopenmp/+0noopeiimp en- 
ables/disables OpenMP, and in Intel Fortran compiler option -openmp enables 
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Parameter Value 



3.41 A 



e 



119.8 kJ/mol 



Mass 



40.0 atomic unit 



Density 



0.6 a 



-3 



Cutoff distance 2.5 a 



Skin 



0.5 0- 



Time step 



0.8 fs 



Table 1 

Parameters in the MD simulation 

In order to verify the improved neighbor list algorithm, all neighbor lists are 
dumped to disk files and compared with those in the execution with Ver- 
let table algorithm. In the verification stage, some statistical quantities, such 
as total potential energy, total kinetic energy, transient temperature of sys- 
tem and the trajectory of all atoms have been recorded in the interval of 10 
timesteps, and data generated from three algorithms are compared and ensure 
they differ in round-off errors only. Intensive tests show that the neighbor lists 
from the improved algorithm are exactly the same as those from the other two 
algorithms, and three different algorithms give same results precisely. 

Several simulations with different number of atoms and the same density are 
carried out, and the performance is calculated and recorded. 

Firstly we measure the neighbor list constructing time with different number 
of atoms for three algorithms, and results are shown in Fig. 5. In order to 
measure the CPU time accurately, a very large number of atoms are used. From 
the results we can see that the neighbor list constructing time of Verlet table 
algorithm scales 0(A^^) as the number of atoms, while the other two algorithms 
scale 0{N). We can also see that the neighbor hst constructing time of the 
improved algorithm is slightly larger than that of the cell linked list algorithm. 
However, the neighbor list needs to be constructed in every timcstep in the 
cell linked list algorithm, but it needs not in our improved algorithm, thus 
the CPU time used to maintain the neighbor hst in the improved algorithm 
is much less than those in the other two algorithms. 

Then we carry out the real benchmark and calculate the overall performance 
of the simulation with different algorithms. Results are shown in Fig. 6 (single 
processor results) and Fig. 7 (dual-processor results) 

From the results of Fig. 6, we can know that the proposed improved algorithm 



OpenMP. 
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10000 100000 1e+06 

Number of atoms 



Fig. 5. The neighbor list constructing time of three algorithms. Three curves from 
top to bottom denote Verlet table algorithm, the improved algorithm and the cell 
linked list algorithm, respectively. 
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1000 10000 100000 1e+06 

Number of atoms 

Fig. 6. Comparison of three algorithms on a single processor system. The perfor- 
mance is measured in the unit of "atom-step/second", which can be calculated by 
multiplying the number of atoms by the number of steps and divided by the elapsed 
CPU time. The three curves from top to bottom denote the performance of the 
improved algorithm, Verlet table algorithm and the cell linked list algorithm, re- 
spectively. 




1000 10000 100000 1e+06 

Number of atoms 

Fig. 7. Comparison of three algorithms on a dual-processor system. The three curves 
from top to bottom denote the performance of the improved algorithm, Verlet table 
algorithm and the cell linked list algorithm, respectively. 
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improves the overall performance of the molecular dynamics simulation signif- 
icantly. When the system is small, the performance of the improved algorithm 
is as high as that of Verlet table algorithm. As the system size increases and 
there are more and more atoms, the performance of Verlet table algorithm 
decreases rapidly, but the performance of the improved algorithm becomes 
even better. As the system is as big as 7 x 10^ atoms, the performance of 
Verlet table algorithm becomes very bad, and is much lower than that of the 
improved algorithm. On the other hand, although the cell linked list algo- 
rithm can handle a very large system due to its small memory requirement, 
its performance is much lower than that of the improved algorithm. For large 
systems, the improved algorithm in our work is about 2 ~ 3 times faster than 
conventional neighbor list algorithms. 

Prom Fig. 7 we can see that the improved algorithm in SMP platforms exhibits 
higher performance than the other two algorithms. The conclusions in above 
discussions for single-processor are also valid for dual-processor. 



4 Conclusions 

Nowadays as the continuously increasing of computing power, huge and com- 
plicated molecular simulations will be attempted and involve a larger number 
of atoms and more complex potential functions. The expectation of running 
molecular simulations faster and easier for larger systems on existing platforms 
makes it important to improve the neighbor list algorithm in order to reduce 
the unnecessary interatomic distance calculations. In this paper, we proposed 
an improved order 0{N) neighbor list algorithm which incorporates the ad- 
vantages of conventional Verlet table and cell linked list algorithms. In the new 
algorithm, cell decomposition approach is adopted to accelerate the neighbor 
list construction speed, and data sorting method is used to maximize the 
CPU data cache hit rate. In addition, partial updating of the neighbor list is 
introduced to minimize unnecessary neighbor list reconstructions. We carried 
out the benchmarks to these algorithms using molecular dynamics simulations 
with Lennard- Jones potential function, and compared the performance of the 
improved algorithm with that of the other two algorithms. The results show 
that the improved algorithm outperforms conventional Verlet table and the 
cell linked list algorithms by a factor of 2 ~ 3 in single-processor and SMP 
platforms. 
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